Template 3:
Welcome! This template will guide you through a Bayesian analysis in R, even if you have never done Bayesian analysis before. There are a set of templates, each for a different type of analysis. This template is for data with a single continuous independent variable and will produce a line chart. If your analysis includes a linear regression, this might be the right template for you.
This template assumes you have basic familiarity with R. Once complete, this template will produce a summary of the analysis, complete with parameter estimates and credible intervals, and two animated HOPs (see Hullman, Resnick, Adar 2015 DOI: 10.1371/journal.pone.0142444 and Kale, Nguyen, Kay, and Hullman VIS 2018 for more information) for both your prior and posterior estimates.
This Bayesian analysis focuses on producing results in a form that are easily interpretable, even to nonexperts. The credible intervals produced by Bayesian analysis are the analogue of confidence intervals in traditional null hypothesis significance testing (NHST). A weakness of NHST confidence intervals is that they are easily misinterpreted. Many people naturally interpret an NHST 95% confidence interval to mean that there is a 95% chance that the true parameter value lies somewhere in that interval; in fact, it means that if the experiment were repeated 100 times, 95 of the resulting confidence intervals would include the true parameter value. The Bayesian credible interval sidesteps this complication by providing the intuitive meaning: a 95% chance that the true parameter value lies somewhere in that interval. To further support intuitive interpretations of your results, this template also produces animated HOPs, a type of plot that is more effective than visualizations such as error bars in helping people make accurate judgments about probability distributions.
This set of templates supports a few types of statistical analysis. (In future work, this list of supported statistical analyses will be expanded.) For clarity, each type has been broken out into a separate template, so be sure to select the right template before you start! A productive way to choose which template to use is to think about what type of chart you would like to produce to summarize your data. Currently, the templates support the following:
One independent variable:
Categorical; bar graph (e.g. t-tests, one-way ANOVA)
Ordinal; line graph (e.g. t-tests, one-way ANOVA)
Continuous; line graph (e.g. linear regression)
Two independent variables:
Two categorical; bar graph (e.g. two-way ANOVA)
One categorical, one ordinal; line graph (e.g. two-way ANOVA)
One categorical, one continuous; line graph (e.g. linear regression with multiple lines)
Note that this template fits your data to a model that assumes normally distributed error terms. (This is the same assumption underlying t-tests, ANOVA, etc.) This template requires you to have already run diagnostics to determine that your data is consistent with this assumption; if you have not, the results may not be valid.
Once you have selected your template, to complete the analysis, please follow along this template. For each code chunk, you may need to make changes to customize the code for your own analysis. In those places, the code chunk will be preceded by a list of things you need to change (with the heading “What to change”), and each line that needs to be customized will also include the comment #CHANGE ME within the code chunk itself. You can run each code chunk independently during debugging; when you’re finished, you can knit the document to produce the complete document.
Good luck!
This template comes prefilled with an example dataset from Moser et al. (DOI: 10.1145/3025453.3025778), which examines choice overload in the context of e-commerce. The study examined the relationship between choice satisfaction (measured at a 7-point Likert scale), the number of product choices presented on a webpage, and whether the participant is a decision “maximizer” (a person who examines all options and tries to choose the best) or a “satisficer” (a person who selects the first option that is satisfactory). In this template, we analyze the relationship between the number of choices presented, which we treat as a continuous variable in this template with values that can fall in the range [12,72]; and choice satisfaction, which we treat as a continuous variable with values that can fall in the range [1,7].
If this is your first time using the template, you may need to install libraries. Uncomment the lines below - install.packages() and devtools::install_github() - to install the required packages. This only needs to be done once.
knitr::opts_chunk$set(fig.align="center")
# install.packages("rstanarm", "tidyverse", "tidybayes", "modelr", "devtools")
# devtools::install_github("thomasp85/gganimate")
library(rstanarm) #bayesian analysis package
library(tidyverse) #tidy datascience commands
library(tidybayes) #tidy data + ggplot workflow
library(modelr) #tidy pipelines when modelling
library(ggplot2) #plotting package
library(gganimate) #animate ggplots
theme_set(theme_light()) # set the ggplot theme for all plots What to change
mydata = read.csv('datasets/choc_cleaned_data.csv') #CHANGE ME 1We’ll fit the following model: stan_glm(y ~ x), which specifies a linear regression where each \(y_i\) is drawn from a normal distribution with mean equal to \(a + bx_i\) and standard deviation equal to sigma (\(\sigma\)):
\[ y_i \sim Normal(a + bx_i, \sigma) \]
Choose your independent and dependent variables. These are the variables that will correspond to the x and y axis on the final plots.
What to change
mydata$x: Select which variables will appear on the x-axis of your plots.
mydata$y: Select which variables will appear on the y-axis of your plots.
x_lab: Label your plots’ x-axes.
y_lab: Label your plots’ y-axes.
#select your independent and dependent variable
mydata$x = mydata$num_products_displayed #CHANGE ME 2
mydata$y = mydata$satis_Q1 #CHANGE ME 3
# label the axes on the plots
x_lab = "Choices" #CHANGE ME 4
y_lab = "Satisfaction" #CHANGE ME 5In this section, you will set priors for your model. Setting priors thoughtfully is important to any Bayesian analysis, especially if you have a small sample of data that you will use for fitting for your model. The priors express your best prior belief, before seeing any data, of reasonable values for the model parameters. The model estimation process produces a posterior distribution: beliefs about the plausible values for parameters, given the data in your dataset.
Ideally, you will have previous literature from which to draw these prior beliefs. If no previous studies exist, you can instead assign “weakly informative priors” that only minimally restrict the model; for example, a weakly informative prior for a parameter that can only have values between 1 and 7 would assign a very small probability to values outside of that range. We have provided an example of how to set priors below.
To check the plausibility of your priors, use the code section after this one to generate a graph of 100 plausible fit lines from your priors to check if the values generated are reasonable.
Our model has the following parameters: a. the y-value at the intercept b. the slope of the line c. the standard deviation of the normally distributed error term
To simplify things, we only let you specify beliefs about these parameters in the form of a normal distribution. Thus, you will specify what you think is the most likely value for the parameter (the mean), and a standard deviation. You will be expressing a belief that you were 95% certain (before looking at any data) that the true value of the parameter is within two standard deviations of the mean.
Finally, our modeling system, stan_glm(), will automatically set priors for the last parameter, the standard deviation of the normally distributed error term for the model overall. That is almost never a model parameter that you will be interpreting, and STAN does a reasonable job of assigning weakly informative priors for that parameter in a way that it won’t have an impact on the estimation of other parameters, which we normally do care about.
What to change
a_prior: Select the intercept mean.
a_sd: Select the intercept standard deviation.
b1_prior: Select the slope mean.
b1_sd: Select the slope standard deviation.
You should also change the comments in the code below to explain your choice of priors.
# CHANGE THIS COMMENT EXPLAINING YOUR CHOICE OF PRIORS (10)
# In our example dataset, y-axis scores can be in the range [1, 7].
# Thus, a mean value in the control condition of less than 1 or greater
# than 7 is impossible. With a normal distribution, we can't completely
# rule out those impossible values, but we choose a mean and sd that assign
# less than 5% probability to those impossible values.
# We select the mean of the range [1,7], and an sd that assigns a
# 95% probability to values that vary up to +3/-3 from this mean.
a_prior = 4 # CHANGE ME 6
a_sd = 1.5 # CHANGE ME 7
# CHANGE THIS COMMENT EXPLAINING YOUR CHOICE OF PRIORS (10)
# In this example, we will say we do not have a strong hypothesis about the effect
# of choice set size on satisfaction, so we set the mean of the effect size
# parameters to be 0. In the absence of other information, we set the sd so that
# a change from the minimum choice set size (12) to the maximum choice set size (72)
# could plausibly result in a +6/-6 change in satisfaction, the maximum possible change.
b1_prior = 0 # CHANGE ME 8
b1_sd = (6/(72-12))/2 # CHANGE ME 9Next, you’ll want to check your priors by running this code chunk. It will produce a set of 100 sample draws drawn from the priors you set in the previous section, so you can check to see if the values generated are reasonable. (We’ll go into the details of this code later.)
What to change
Nothing! Just run this code to check your priors, adjusting prior values above as needed until you find reasonable prior values. Note that you may get a couple of very implausible or even impossible values because our assumption of normally distributed priors assigns a small probability to even very extreme values. If you are concerned by the outcome, you can try rerunning it a few more times to make sure that any implausible values you see don’t come up very often.
#generate sample draws from the priors
m_prior = stan_glm(y ~ x, data = mydata,
prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
prior = normal(b1_prior, b1_sd, autoscale = FALSE),
prior_PD = TRUE
)
#create the dataframe for fitted draws & create the plot
mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
add_fitted_draws(m_prior, n = 100, seed = 12345) %>%
ggplot(aes(x = x, y = .value)) +
geom_line(aes(group = .draw), alpha = .2) +
labs(x=x_lab, y=y_lab) # axes labels There’s nothing you have to change here. Just run the model.
In rare cases, you may get a warning that the Markov chains have failed to converge. Chains that fail to converge are a sign that your model is not a good fit to the data. If you get this warning, you should adjust your priors. Your prior distribution may be too narrow, and/or your prior mean is very far from the data.
m = stan_glm(y ~ x, data = mydata,
prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
prior = normal(b1_prior, b1_sd, autoscale = FALSE)
)Here is a summary of the model fit.
The summary reports diagnostic values that can help you evaluate whether your model is a good fit for the data. For this template, we can keep diagnostics simple: check that your Rhat values are very close to 1.0. Larger values mean that your model is not a good fit for the data. This is usually only a problem if the Rhat values are greater than 1.1, which is a warning sign that the Markov chains have failed to converge. In this happens, Stan will warn you about the failure, and you should adjust your priors.
summary(m, digits=3)##
## Model Info:
##
## function: stan_glm
## family: gaussian [identity]
## formula: y ~ x
## algorithm: sampling
## priors: see help('prior_summary')
## sample: 4000 (posterior sample size)
## observations: 611
## predictors: 2
##
## Estimates:
## mean sd 2.5% 25% 50% 75%
## (Intercept) 6.167 0.098 5.980 6.099 6.166 6.233
## x -0.001 0.002 -0.005 -0.002 -0.001 0.001
## sigma 1.027 0.030 0.970 1.006 1.026 1.047
## mean_PPD 6.134 0.059 6.018 6.093 6.133 6.174
## log-posterior -887.325 1.258 -890.630 -887.884 -886.992 -886.417
## 97.5%
## (Intercept) 6.358
## x 0.003
## sigma 1.087
## mean_PPD 6.252
## log-posterior -885.907
##
## Diagnostics:
## mcse Rhat n_eff
## (Intercept) 0.002 1.000 4000
## x 0.000 1.000 4000
## sigma 0.000 0.999 3591
## mean_PPD 0.001 1.000 4000
## log-posterior 0.028 1.001 1954
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Given this model, we might want to plot the fit line with credible bands around it. To do that, we will first construct a fit grid: a data frame of points at which we want to calculate the value of the fit line from the model. The data_grid function allows us to do this easily, e.g. by asking for 20 equally spaced points along the value of the x variable in our original data:
mydata %>%
data_grid(x = seq_range(x, n = 20))## # A tibble: 20 x 1
## x
## <dbl>
## 1 12
## 2 15.2
## 3 18.3
## 4 21.5
## 5 24.6
## 6 27.8
## 7 30.9
## 8 34.1
## 9 37.3
## 10 40.4
## 11 43.6
## 12 46.7
## 13 49.9
## 14 53.1
## 15 56.2
## 16 59.4
## 17 62.5
## 18 65.7
## 19 68.8
## 20 72
Given this grid, we can then draw samples from the posterior mean evaluated at each x position in the grid using the add_fitted_draws function, and then summarize these samples in ggplot using a stat_lineribbon:
mydata %>%
data_grid(x = seq_range(x, n = 20)) %>%
add_fitted_draws(m) %>%
ggplot(aes(x = x, y = .value)) +
stat_lineribbon() +
coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
scale_fill_brewer()But what we really want is to display a selection of plausible fit lines, say 100 of them. To do that, we instead ask add_fitted_draws for only 100 draws, which we plot separately as lines:
mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
# the seed argument is for reproducibility: it ensures the pseudo-random
# number generator used to pick draws has the same seed on every run,
# so that someone else can re-run this code and verify their output matches
add_fitted_draws(m, n = 100, seed = 12345) %>%
ggplot(aes(x = x, y = .value)) +
coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
geom_line(aes(group = .draw), alpha = .2)Even better would be to animate this graph using HOPs (Hypothetical Outcomes Plot), a type of plot that visualizes uncertainty as sets of draws from a distribution, which has been demonstrated to improve multivariate probability estimates (Hullman et al. 2015) and increase sensitivity to the underlying trend in data (Kale et al. 2018) over static representations of uncertainty like error bars.
To set up the HOPs, we will first set the aesthetics for the ggplot that we will use.
You can set the aesthetics of your HOPs here.
What to change
In most cases, the default values here should be just fine. If you want to adjust the aesthetics of the animated plots later, you can do so here using the ggplot2 package; just be sure to keep the lines that are commented with “do not change.” Below are two optional code customizations that we think may be particularly useful for some datasets.
# the default code for the plots - if needed, the animated plot aesthetics can be customized here
graph_plot = function(data) {
ggplot(data, aes(x = x, y = .value)) + #do not change
geom_line() + #do not change
transition_states(.draw, transition_length = 1, state_length = 1) + # gganimate code to animate the plots. Do not change
coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME 11 (optional)
labs(x=x_lab, y=y_lab) # axes labels
}
# Animation parameters
n_draws = 100 # the number of draws to visualize in the HOPs
frames_per_second = 2.5 # the speed of the HOPs
# 2.5 frames per second (400ms) is the recommended speed for the HOPs visualization.
# Faster speeds (100ms) have been demonstrated to not work as well.
# See Kale et al. VIS 2018 for more info.Now that the plot aesthetics are set, we can return to our fit grid and repeatedly draw samples from the posterior mean evaluated at each x position in the grid using the add_fitted_draws function. Each frame of the animation shows a different draw from the posterior:
p = mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
add_fitted_draws(m, n = n_draws, seed = 12345)
# the seed argument is for reproducibility: it ensures the pseudo-random
# number generator used to pick draws has the same seed on every run,
# so that someone else can re-run this code and verify their output matches
animate(graph_plot(p), nframes = n_draws * 2, fps = frames_per_second)For more context, we could also show the fit lines with the data:
animate(graph_plot(p) + geom_count(aes(y = y), data = mydata), nframes = n_draws * 2, fps = frames_per_second)We already looked at some sample plots of the priors when we were setting priors; now we want to look at these priors again, but in a HOPs format so we can compare to the posterior plots. To get the prior plots, we can simply ask stan_glm to sample from the prior.
What to change
If you are knitting this document, or if you already ran the code in the “Check priors” section that calculates m_prior, you can comment out this line:
#m_prior = update(m, prior_PD = TRUE)Then our code to generate plots is identical, except we replace m with m_prior:
p_prior = mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
add_fitted_draws(m_prior, n = n_draws, seed = 12345)
animate(graph_plot(p_prior), nframes = n_draws * 2, fps = frames_per_second) Again, with context:
animate(graph_plot(p_prior) + geom_count(aes(y = y), data = mydata), nframes = n_draws * 2, fps = frames_per_second)